White Wine Expoloration by Wei Zhang

This report explores a dataset containing white wine attributes for 4898 wine.

Univariate Plots Section

## 'data.frame':    4898 obs. of  13 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
##  $ quality.levels      : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##     quality      quality.levels
##  Min.   :3.000   3:  20        
##  1st Qu.:5.000   4: 163        
##  Median :6.000   5:1457        
##  Mean   :5.878   6:2198        
##  3rd Qu.:6.000   7: 880        
##  Max.   :9.000   8: 175        
##                  9:   5

Our dataset consists of 12 variables, with 4898 observations.

Tips: Quality has 10 levels, 0-10. So I took the binwidth as 0.5 of the histogram to show a clear distribution

The distribution is quite clear. The most vote for the quality is around 5-7. I’m wondering what kind of ingrediant influence the quality the most? More or less of them? How to rate a very good wine?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   3.800   6.300   6.800   6.855   7.300  14.200

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0800  0.2100  0.2600  0.2782  0.3200  1.1000

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.2700  0.3200  0.3342  0.3900  1.6600

Most wine likes to add a range of fixed.acidity,volatile.acidity,citric.acit into the wine. Citric.acid has a special data which is around 0.5.

##    Mode   FALSE    TRUE 
## logical    4879      19

We can see 19 wine did not add citric.acid, it does matter to influence the quality of wine or not?

## 
##  0.6  0.7  0.8  0.9 0.95    1 1.05  1.1 1.15  1.2 1.25  1.3 1.35  1.4 1.45 
##    2    7   25   39    4   93    1  146    3  187    3  147    2  184    4 
##  1.5 1.55  1.6 1.65  1.7 1.75  1.8 1.85  1.9 1.95 
##  142    2  165    2   99    1   99    3   59    2

It looks like most wine likes to add 1.1,1.2,1.4,1.5,1.6g/m^3 sugar. But is this the best choice for wine?

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Cholorides has a long tale, but the majority of chlorides is around 0.036 to 0.05 with a mean of 0.04577

In this section, I indroduced a new attibute ratio.free.sulfur.dioxide. It can be seen that the ratio of free sulfur dioxide is about 0.19 to 0.32, and the free sulfur dioxide is about 23 - 46 mg/dm^3

Except the alcohol rate, density,pH,sulphates is like normal distribution. We’ll figure out the relationships between them.

Univariate Analysis

What is the structure of your dataset?

There are 4898 white wines in the dataset with 12 features (“fixed.acidity”,“volatile.acidity”,“citric.acid”,“residual.sugar”,“chlorides”,“free.sulfur.dioxide”,“total.sulfur.dioxide”,“density”,“pH”,“sulphates”,“alcohol”,“quality”)

acidity contains:“fixed.acidity”,“volatile.acidity”,“citric.acid”,“pH”;

sulfur dioxide contains:“free.sulfur.dioxide”,“total.sulfur.dioxide”,“sulphates”;

density contains:“residual.sugar”,“chlorides”,“alcohol”

What is/are the main feature(s) of interest in your dataset?

The main feature is Quality. In this case, I shall figure out the main influences contribute to the quality of white wine.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

“fixed.acidity”,“volatile.acidity”,“citric.acid”,“residual.sugar”,“chlorides”,“free.sulfur.dioxide”,“total.sulfur.dioxide”,“density”,“pH”,“sulphates”,“alcohol” will help support my investigation into the quality

Did you create any new variables from existing variables in the dataset?

I introduced quality.levels to be quality factors.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

I didnot do any perations for now.

Bivariate Plots Section

##                      fixed.acidity volatile.acidity  citric.acid
## fixed.acidity           1.00000000      -0.02269729  0.289180698
## volatile.acidity       -0.02269729       1.00000000 -0.149471811
## citric.acid             0.28918070      -0.14947181  1.000000000
## residual.sugar          0.08902070       0.06428606  0.094211624
## chlorides               0.02308564       0.07051157  0.114364448
## free.sulfur.dioxide    -0.04939586      -0.09701194  0.094077221
## total.sulfur.dioxide    0.09106976       0.08926050  0.121130798
## density                 0.26533101       0.02711385  0.149502571
## pH                     -0.42585829      -0.03191537 -0.163748211
## sulphates              -0.01714299      -0.03572815  0.062330940
## alcohol                -0.12088112       0.06771794 -0.075728730
## quality                -0.11366283      -0.19472297 -0.009209091
##                      residual.sugar   chlorides free.sulfur.dioxide
## fixed.acidity            0.08902070  0.02308564       -0.0493958591
## volatile.acidity         0.06428606  0.07051157       -0.0970119393
## citric.acid              0.09421162  0.11436445        0.0940772210
## residual.sugar           1.00000000  0.08868454        0.2990983537
## chlorides                0.08868454  1.00000000        0.1013923521
## free.sulfur.dioxide      0.29909835  0.10139235        1.0000000000
## total.sulfur.dioxide     0.40143931  0.19891030        0.6155009650
## density                  0.83896645  0.25721132        0.2942104109
## pH                      -0.19413345 -0.09043946       -0.0006177961
## sulphates               -0.02666437  0.01676288        0.0592172458
## alcohol                 -0.45063122 -0.36018871       -0.2501039415
## quality                 -0.09757683 -0.20993441        0.0081580671
##                      total.sulfur.dioxide     density            pH
## fixed.acidity                 0.091069756  0.26533101 -0.4258582910
## volatile.acidity              0.089260504  0.02711385 -0.0319153683
## citric.acid                   0.121130798  0.14950257 -0.1637482114
## residual.sugar                0.401439311  0.83896645 -0.1941334540
## chlorides                     0.198910300  0.25721132 -0.0904394560
## free.sulfur.dioxide           0.615500965  0.29421041 -0.0006177961
## total.sulfur.dioxide          1.000000000  0.52988132  0.0023209718
## density                       0.529881324  1.00000000 -0.0935914935
## pH                            0.002320972 -0.09359149  1.0000000000
## sulphates                     0.134562367  0.07449315  0.1559514973
## alcohol                      -0.448892102 -0.78013762  0.1214320987
## quality                      -0.174737218 -0.30712331  0.0994272457
##                        sulphates     alcohol      quality
## fixed.acidity        -0.01714299 -0.12088112 -0.113662831
## volatile.acidity     -0.03572815  0.06771794 -0.194722969
## citric.acid           0.06233094 -0.07572873 -0.009209091
## residual.sugar       -0.02666437 -0.45063122 -0.097576829
## chlorides             0.01676288 -0.36018871 -0.209934411
## free.sulfur.dioxide   0.05921725 -0.25010394  0.008158067
## total.sulfur.dioxide  0.13456237 -0.44889210 -0.174737218
## density               0.07449315 -0.78013762 -0.307123313
## pH                    0.15595150  0.12143210  0.099427246
## sulphates             1.00000000 -0.01743277  0.053677877
## alcohol              -0.01743277  1.00000000  0.435574715
## quality               0.05367788  0.43557472  1.000000000

According to the subset of the data:

quality strong factors:density,alcohol,total.sulfur.dioxide,chlorides

pH strong factors: fixed.acidity,citic.acid,residual.sugar,density

density strong factors: residual.sugar,sulfur.dioxide

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

It canbe seen that, the quality decreased as the density increase, but the inluential rate is not that much. A good wine should have a low density around 0.98711 to 0.995.

It canbe seen that, the alcohol rate and desity decreased while the quality is bellow 5; the alcohol rate and density increased while the quality is above 5.

If it is a good wine, the alcohol rate can be 11.5% to 12.9%.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    34.0   101.0   122.0   125.2   146.0   229.0

In this section, we can see, good white wine only allows a wine has total.sulfur.dioxide about 125.2 mg/m^3

Although, we can not sure that free.sulfur.dioxide really influence the quality of the white wine, we can set it into a narrow numbers, like 30-50mg/m^3

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.01200 0.03100 0.03700 0.03816 0.04400 0.13500

A good wine really has small amount of salt. near 0.031 to 0.044g/m^3

Keep a good pH is nessacery, the range is about 3.0 to 3.4

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

The quality is influences by the wine’s density, alcohol rate chlorides, and the total.sulfur.dixode the most. A good white wine should have a low density, a high alcohol rate, 125mg/m^3 total.sulfur.dixode, and a little chlorides.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

Although, we can not find the strong relationships with pH, but pH is related to fixed.acidity,citic.acid,residual.sugar,density.

Also, look deep in density, the strong factor contains residual.sugar,sulfur.dioxide. The influences just influence each other.

What was the strongest relationship you found?

It is alcohol, the seconde is density, the third is chlorides.

Multivariate Plots Section

residual.sugar is correlated to density, the points along the green line seems to be a good wine than others.

It can be seen that pH is correlated to fixed.acidity. The most popular taste of white wines is about in the mean of pH and fixed.acidity

Also for the citric acid, we can ee that the mean of citric.acid shows higher rate for white wine.

With a high rate of alcohol and a rate of density, can be a very tasty wine.

This part is quite interesting. Just like we need salt for a shock, some people like to add some salt in it, but marjority of people prefer not to.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

PH is influenced by acid, sugar, so ploting the images we can see that the mean of the acid can be used to get a nice white wine. But the best taste of pH is always around 3.0-3.4 which can be seen in Bivariate plots.

Density is influenced by sugar a lot, people likes to drink a low density of white wine but also with sugar.

Were there any interesting or surprising interactions between features?

Chlorides vs achohol is quite interesting. Just like we need salt for a shock, some people like to add some salt in it, but marjority of people prefer not to.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

This model is very week, so I decided to delete it as the r-value is only 0.28. Every expert has its taste, because of the lack of data, we can not model a good accurate model for analyzing the real quality of the wine, but a approximate result.But we can actually conclude some statitical result that what a good white wine looks like


Final Plots and Summary

Plot One

Description One

We can see that, in this dataset of whitewines, the majority of wines are rated as 5,6. As the quality above 8 and 9 is very small, it can not be a convinced datebase for analizing a high qualified wine, but for 5 to 8 wines.

Plot Two

Description Two

This plot contains three viriables density,alchhol, chlorides, these three elements are most contributing to the rate of white wine.

Plot Three

Description Three

This plot shows the realtionship between Alcohol vs Density vs Quality. We can see that the good white wine all gather around the mean value of density, alcohol rate around 10-13%. But still taste is depending on individuals.


Reflection

In this dataset, We can figure out the best choice for a white wine includes following attibutes:

alcohol rate recommended to be 11.5% to 12.9%

density recommended to be 0.98711 to 0.995

chlorides recommended to be 0.031 to 0.044g/m^3

pH recommended to be 3.0 to 3.4

free.sulfur.dioxide recommended to be 30-50mg/m^3

total.sulfur.dioxide recommended to be 125.2 mg/m^3

Using this data can help you get a nice drink wine.

Conclusion

We can see in the dataset: a good wine is required accurate dot of acid, sugar, salt, and sulfur dixode. The rate can only reflects some recommendation.

But because of the lack of data for quality 8 - 9, only have 180 white wines, we cannot build an accurate model to analyze the result.

For future work, we should gather more and more data about high quality white wines in order to find out the real solution to judge a good white wines.